Building a Large Automatically Parsed Corpus of Finnish

نویسندگان

  • Filip Ginter
  • Jenna Nyblom
  • Veronika Laippala
  • Samuel Kohonen
  • Katri Haverinen
  • Simo Vihjanen
  • Tapio Salakoski
چکیده

We describe the methods and resources used to build FinnTreeBank-3, a 76.4 million token corpus of Finnish with automatically produced morphological and dependency syntax analyses. Starting from a definition of the target dependency scheme, we show how existing resources are transformed to conform to this definition and subsequently used to develop a parsing pipeline capable of processing a large-scale corpus. An independent formal evaluation demonstrates high accuracy of both morphological and syntactic annotation layers. The parsed corpus is freely available within the FIN-CLARIN infrastructure project.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parsed Corpora for Linguistics

Knowledge-based parsers are now accurate, fast and robust enough to be used to obtain syntactic annotations for very large corpora fully automatically. We argue that such parsed corpora are an interesting new resource for linguists. The argument is illustrated by means of a number of recent results which were established with the help of parsed corpora.

متن کامل

DCG Induction using MDL and Parsed

We show how partial models of natural language syntax (manually written DCGs, with parameters estimated from a parsed corpus) can be automatically extended when trained upon raw text (using MDL). We also show how we can use a parsed corpus as an alternative constraint upon estimation. Empirical evaluation suggests that a parsed corpus is more informative than a MDL-based prior. However , best r...

متن کامل

Dependencies vs. Constituents for Tree-Based Alignment

Given a parallel parsed corpus, statistical treeto-tree alignment attempts to match nodes in the syntactic trees for a given sentence in two languages. We train a probabilistic tree transduction model on a large automatically parsed Chinese-English corpus, and evaluate results against human-annotated word level alignments. We find that a constituent-based model performs better than a similar pr...

متن کامل

Robust VPE Detection using Automatically Parsed Text

This paper describes a Verb Phrase Ellipsis (VPE) detection system, built for robustness, accuracy and domain independence. The system is corpus-based, and uses machine learning techniques on free text that has been automatically parsed. Tested on a mixed corpus comprising a range of genres, the system achieves a 70% F1-score. This system is designed as the first stage of a complete VPE resolut...

متن کامل

Huge Parsed Corpora in LASSY

One of the goals of the LASSY STEVIN project (Large Scale Syntactic Annotation of written Dutch) is a syntactically annotated (manually verified) corpus of 1 million words. In addition, the full STEVIN reference corpus of 500 million words will be syntactically annotated automatically. In this paper, the potential of such huge treebanks for applications in corpus linguistics, natural language p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013